Dockerfile (instructions used to assemble a full operating environment) and github workflow file (instructions for building and packaging the docker image on github via actions).
This document comprises the manual for the Darwin Harbour sediment monitoring program analysis application. It provides information on:
R Graphical and Statistical Environment offers an ideal platform for developing and running complex statistical analyses as well as presenting the outcomes via professional graphical/tabular representations. As a completely scripted language it also offers the potential for both full transparency and reproducibility. Nevertheless, as the language, and more specifically the extension packages are community developed and maintained, the environment evolves over time. Similarly, the underlying operating systems and programs on which R and its extension packages depend (hereafter referred to as the operating environment) also change over time. Consequently, the stability and reproducibility of R codes have a tendency to change over time.
One way to attempt to future proof a codebase that must be run upon a potentially unpredictable operating environment is to containerise the operating environment, such that it is preserved to remain unchanged over time. Containers (specifically docker containers) are lightweight abstraction units that encapsulate applications and their dependencies within standardized, self-contained execution environments. Leveraging containerization technology, they package application code, runtime, libraries, and system tools into isolated units (containers) that abstract away underlying infrastructure differences, enabling consistent and predictable execution across diverse computing platforms.
Containers offer several advantages, such as efficient resource utilization, rapid deployment, and scalability. They enable developers to build, test, and deploy applications with greater speed and flexibility. Docker containers have become a fundamental building block in modern software development, enabling the development and deployment of applications in a consistent and predictable manner across various environments.
Shiny is a web application framework for R that enables the creation of interactive and data-driven web applications directly from R scripts. Developed by Rstudio, Shiny simplifies the process of turning analyses into interactive web-based tools without the need for extensive web development expertise.
What makes Shiny particularly valuable is its seamless integration with R, allowing statisticians and data scientists to build and deploy bespoke statistical applications, thereby making data visualization, exploration, and analysis accessible to a broader audience. With its interactive and user-friendly nature, Shiny serves as a powerful tool for sharing insights and engaging stakeholders in a more intuitive and visual manner.
Git, a distributed version control system, and GitHub, a web-based platform for hosting and collaborating on Git repositories, play pivotal roles in enhancing reproducibility and transparency in software development. By tracking changes in source code and providing a centralized platform for collaborative work, Git and GitHub enable developers to maintain a detailed history of code alterations. This history serves as a valuable asset for ensuring the reproducibility of software projects, allowing users to trace and replicate specific versions of the codebase.
GitHub Actions (an integrated workflow automation feature of GitHub), automates tasks such as building, testing, and deploying applications and artifacts. Notably, through workflow actions, GitHub Actions can build docker containers and act as a container registry. This integration enhances the overall transparency of software development workflows, making it easier to share, understand, and reproduce projects collaboratively.
Figure 1 provides a schematic overview of the relationship between the code produced by the developer, the Github cloud repositiory and container registry and the shiny docker container run by user.
Dockerfile (instructions used to assemble a full operating environment) and github workflow file (instructions for building and packaging the docker image on github via actions).
To retrieve and run docker containers requires the installation of Docker Desktop on Windows and MacOSx
The steps for installing Docker Desktop are:
Download the Installer: head to https://docs.docker.com/desktop/install/windows-install/ and follow the instructions for downloading the appropriate installer for your Windows version (Home or Pro).
Run the Installer: double-click the downloaded file and follow the on-screen instructions from the installation wizard. Accept the license agreement and choose your preferred installation location.
Configure Resources (Optional): Docker Desktop might suggest allocating some system resources like CPU and memory. These settings can be adjusted later, so feel free to use the defaults for now.
Start the Docker Engine: once installed, click the “Start Docker Desktop” button. You may see a notification in the taskbar - click it to confirm and allow Docker to run in the background.
Verification: open a terminal (or Powershell) and run docker --version. If all went well, you should see information about the installed Docker Engine version.
Additional Tips:
The task of installing and running the app is performed via a single deploy script (deploy.bat on Windows or deploy.sh on Linux/MacOSX/wsl). For this to work properly, the deploy script should be placed in a folder along with a folder (called input) that containsthe input datasets (in excel format). This structure is illustrated below for Windows.
\
|- deploy.bat
|- input
|- dataset1.xlsx
|- dataset2.xlsx
In the above illustration, there are two example datasets (dataset1.xlsx and dataset2.xlsx). The datasets need NOT be called dataset1.xlsx. They can have any name you choose, so long as they are excel files that adhere to the structure outlined in Section 4.1.
This Shiny application is designed to ingest very specifically structured excel spreadsheets containing Darwin Harbour sediment monitoring data and produce various analyses and visualisations. The application is served from a docker container to the localhost and the default web browser.
Docker containers can be thought of a computers running within other computers. More specifically, a container runs an instance of image built using a series of specific instructions that govern the entire software environment. As a result, containers run from the same image will operate (virtually) identically regardless of the host environment. Furthermore, since the build instructions can specify exact versions of all software components, containers provide a way of maximising the chances that an application will continue to run as designed into the future despite changes to operating environments and dependencies.
This shiny application comprises five pages (each accessable via the sidebar menu on the left side of the screen):
Each page will also contain instructions to help guide you through using or interpreting the information. In some cases, this will take the from of an info box (such as the current box). In other cases, it will take the form of little symbols whose content is revealed with a mouse hover.
There are numerous stages throughout the analysis pipeline that may require user review (for example examining the exploratory data analysis figures to confirm that the data are as expected). Consequently, it is necessary for the user to manually trigger each successive stage of the pipeline. The stages are:
Stage 1 - Prepare environment
More info
This stage is run automatically on startup and essentially sets up the operating environment.
Stage 2 - Obtain data
More info
This stage comprises of the following steps:
The tables within the Raw data tab of the Data page will also be populated.
Stage 3 - Process data
More info
This stage comprises of the following steps:
The tables within the Processed data tab of the Data page will also be populated.
Stage 4 - Exploratory data analysis
More info
This stage comprises of the following steps:
The exploratory data figures of the Exploratory Data Analysis page will also be populated.
Stage 5 - Temporal analyses
More info
This stage comprises of the following steps:
Underneath the sidebar menu there are a series of buttons that control progression through the analysis pipeline stages. When a button is blue (and has a play icon), it indicates that the Stage is the next Stage to be run in the pipeline. Once a stage has run, the button will turn green. Grey buttons are disabled.
Clicking on button will run that stage. Once a stage has run, the button will change to either green (success), yellow (orange) or red (failures) indicating whether errors/warnings were encountered or not. If the stage was completed successfully, the bottom corresponding to the next available stage will be activated.
Sidebar menu items that are in orange font are active and clicking on an active menu item will reveal an associated page. Inactive menu items are in grey font. Menu items will only become active once the appropriate run stage has been met. The following table lists the events that activate a menu item.
| Menu Item | Trigger Event |
|---|---|
| Landing | Always active |
| Dashboard | Always active |
| Data | After Stage 2 |
| Exploratory Data Analysis | After Stage 4 |
| Analysis | After Stage 5 |
| Manual | Always active |
To be valid, input data must be excel files (*.xlsx) comprising at least the following sheets (each of which must at least have the fields listed in their respective tables):
metals
| Field | Description | Validation conditions |
|---|---|---|
| Sample_ID | unique sample ID | must contain characters |
| *¹ (mg/kg) | observed concentration of metal in sediment sample | must contain only numbers or start with a ‘<’ symbol |
1: where the ’*’ represents a one or two character chemical symbol (such as ‘Ag’ or ‘V’). There should be numerous of these fields
hydrocarbons
| Field | Description | Validation conditions |
|---|---|---|
| Sample_ID | unique sample ID | must contain characters |
| >C*¹ | observed concentration of hydrocarbons within a specific size bin in sediment sample | must contain only numbers or start with a ‘<’ symbol |
total_carbons
| Field | Description | Validation conditions |
|---|---|---|
| Sample_ID | unique sample ID | must contain characters |
| TOC (%) | observed total organic carbon (as a percentage of the sample weight) | must contain only numbers |
metadata
| Field | Description | Validation conditions |
|---|---|---|
| IBSM_site | name of the site from the perspective of IBSM | must contain characters (or be blank) |
| Site_ID | a unique site ID | must contain characters (cannot be blank) |
| Sample_ID | unique sample ID (the key to data sheets) | must contain characters (cannot be blank) |
| Original_SampleID | unique sample ID | must contain characters |
| Latitude | site latitude | must be numeric (and negative) |
| Longitude | site longitude | must be numeric |
| Acquire_date_time | date and time sample was collected (D/M/YYYY hh:mm:ss) | must be in datetime format |
| Sampler | name of person responsible for collecting sample (ignored) | ignored |
| Notes | project description (ignored) | ignored |
| Baseline_site | the unique site ID of the corresponding baseline sample | must contain characters (cannot be blank) |
| Baseline_acquire_date_site | the date and time of the corresponding baseline sample | must be in datetime format |
notes - this sheet is not processed or validated
To run this tool, please adhere to the following steps:
The analysis pipeline comprises numerous Stages, each of which is made up of several more specific Tasks. The individual Tasks represent an action performed in furtherance of the analysis and of which there are reportable diagnostics. For example, once the application loads, the first Stage of the pipeline is to prepare the environment. The first Task in this Stage is to load the necessary R packages used by the codebase. Whilst this technically, this action consists of numerous R calls (one for each package that needs to be loaded), the block of actions are evaluated as a set.
Initially, all tasks are reported as “pending” (). As the pipeline progresses, each Task is evaluated and a status is returned as either “success” () or “failure” ().
The Stage that is currently (or most recently) being run will be expanded, whereas all other Stages will be collapsed (unless they contain errors). It is also possible to expand/collapse a Stage by double clicking on its title (or the small arrow symbol at the left side of the tree).
As the pipeline progresses, Task logs are written to a log_file and echoed to the Logs panel. Each row represents the returned status of a specific Task and are formatted as:
SUCCESS the task succeededFAILURE the task failed and should be investigatedWARNING the task contained a warning - typically these can be ignored as they are usually passed on from underlying routines and are more targetted to developers than users.The Logs in the Log panel are presented in chronological order and will autoscroll such that the most recent log is at the bottom of th e display. If the number of Log lines exceeds 10, a scroll bar will appear on the right side of the panel to help reviewing earlier Logs.
The Progress panel also has a tab (called Terminal-like) also has an alternative representation of the status and progress of the pipeline.